Source:

Source: Dr.P.Soundarapandian.M.D.,D.M (Senior Consultant Nephrologist), Apollo Hospitals, Managiri, Madurai Main Road, Karaikudi, Tamilnadu, India.

Creator: L.Jerlin Rubini(Research Scholar) Alagappa University, EmailId :jel.jerlin '@' gmail.com ContactNo :+91-9597231281

Guided by: Dr.P.Eswaran Assistant Professor, Department of Computer Science and Engineering, Alagappa University, Karaikudi, Tamilnadu, India. Emailid:eswaranperumal '@' gmail.com

Content:

  1. Load the Data

    • Import libraries
    • Load the datasets
  2. Overview of the Data

    • Descriptive Statistics
    • Missing Values
  3. Data Preparation

    • Data Cleaning
  4. Exploratory Data Analysis

    • Create list of columns by data type
    • Check the distribution of target class
    • Check the distribution of every feature
    • Check how differnt numerical features are realated to target class
    • Feature Encoding
  5. Model Building

    • Split X & y
    • Feature Scaling
    • Train Test split
    • Train Model
    • Model Prediction
    • Model Evaluation
    • Feature importance
  6. Improve Model

    • Handle Class Imbalance
    • Hyperparameter Tuning
    • Save the Final Model

Inputs

The notebook is designed in such a way that you just need to plug in the input values given below and run the code. It will run on it's own and will build the model as well.

1. Load the Data

In this section you will:

1.1. Import Libraries

Import all the libraries in the first cell itself

1.2. Load the datasets

Load the dataset using pd.read_csv()

2. Overview of the Data

Before attempting to solve the problem, it's very important to have a good understanding of data.

In this section you will:

2.1. Descriptive Statistics

As the name says descriptive statistics describes the data. It gives you information about

Let's understand the data we have

2.1.1 parsing metadata columns names to rename df column names

convert some features to numeric

2.2 Missing Values

Get the info about missing values in the dataframe

3. Data Preparation

The data is not yet ready for model building. You need to process the data and make it ready for model building

In this section you will:

3.1. Data Cleaning

Machine Learning works on the idea of garbage in - garbage out. If you feed in dirty data, the results won't be good. Hence it's very important to clean the data before training the model.

Sklearn algorithms need missing value imputation but XGBoost, LightGBM etc does not require missing value imputation

There are various ways to handle missing values. Some of the ways are:

Here you can decide how you want to handle the missing data

3.2. Feature Encoding

Encoding is the process of converting data from one form to another. Most of the Machine learning algorithms can not handle categorical values unless we convert them to numerical values. Many algorithm’s performances vary based on how Categorical columns are encoded.

There are lot of ways in which you can encode the categorical variables. Some of those are:

4. Exploratory Data Analaysis

Exploratory data analysis is an approach to analyze or investigate data sets to find out patterns and see if any of the variables can be useful in predicting the y variables. Visual methods are often used to summarise the data. Primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing tasks.

In this section you will:

4.1. Extract data types of columns

It's better to get the list of columns by data types in the start itself. You won't have to manually write the name of columns while performing certain operations. So always get the list of columns in the start itself.

Note : There might be some mismatch in the data type of the columns, so in such cases you will have to correct it manually

4.2 Check distribution of target class

You need to check the distribution of target class, see how many categories are there, is it balanced or not

4.3. Check the distribution of every feature

Dive deeper on correlations, since some correlations could determine target feature(class) and demonstrated by visualization

Positive Correlation: specific_gravity -> red_blood_cell_count, packed_cell_volume, haemoglobin sugar -> blood_glucose_random blood_urea -> serum_creatinine haemoglobin -> red_blood_cell_count <- packed_cell_volume

Negative Correlation: Albumin, Blood urea -> Red blood cell count, packed cell volume, Haemoglobin Serum creatinine -> Sodium

4.4 Check how differnt numerical features are realated to target class

5. Model Building

In this section you will:

5.1. Split X and y

Split the X and y dataset

5.2. Feature Scaling

It is a technique to standardize the x variables (features) present in the data in a fixed range. It needs to be done before training the model.

But if you are using tree based models, you should not go for feature scaling

5.3 Train - Test Split

Split the dataset in training and test set

5.4 Train Model

Train the model on training data

5.5 Model fit, Predictions

Get the predictions from the model on testing data

5.6. Model Evaluation

Get the evaluation metrics to evaluate the performance of model on testing data

5.6 Feature importance

Using hypothesis testing as scoring metric for feature importance

6. Improve Model

The first model you make may not be a good one. You need to improve the model.

In majority of the classification problems, the target class is imbalanced. So you need to balance it in order to get best modelling results.

In this section you will:

6.1 Handle Class Imbalance

Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class.

Most machine learning algorithms work best when the number of samples in each class are about equal. This is because most algorithms are designed to maximize accuracy and reduce error.

Here, you will upsample the minority class

6.2. Hyperparameter Tuning

Hyperparameter is a parameter whose value is set before the learning process begins

Hyperparameter tuning refers to the automatic optimization of the hyper-parameters of a ML model

6.3. Save the final model